========================================================

Abstract

This project is an effort towards partial fulfilment of the requirements for the Udacity’s Data Analyst Nanodegree.

The purpose is to perform an Exploratory Data Analysis (mono-, bi- and multivariate) on a dataset containing physicochemical measurements and tasting results of a sample of red wines.

Goal of the data analysis

Our goal with this dataset is to investigate how the chemical qualities of the wine affect its quality. Ideally, we would be able to come up with a regression model that will enable us to predict the quality of wine given its chemical properties.

As the authors of the dataset mention in their notes, it is possible that there are correlations between some of the measured quantities. Therefore, in the course of our work we will try to detect any such interactions between the variables.

Dataset loading and some preliminary cleaning

We start by loading the dataset (stored in a CSV file) into a data.frame:

Some information about the number and the structure of the observations

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We see that there are 1599 observation of wines, each one containing 13 variables. We can safely remove the “X” column as it simply repeats the natural index of the observations.

The rest of the variables are numeric. This is valid for most of them, since they are results of laboratory measurements. However, the “quality” variable is the tasting result, which is categorical variable having an ordinal scale. Therefore, we can convert this variable to an ordered factor, assuming that a higher number indicates a higher quality.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

In this last “str” output we clearly see that “quality” is now represented by a 6-level ordered factor with the following levels:

## [1] "3" "4" "5" "6" "7" "8"

We test the data for any missing (NA) or numerically bad (NaN) data–there is no such data (TRUE).

Otherwise, the dataset is already in a “tidy” state–a row per observation, with all observation attributes in separate columns.

Univariate Plots Section

In this section we perform initials statistical and graphical analysis of the variables contained in the dataset.

Fixed acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity features somewhat normal distribution with some right-skewedness (skewness = 0.9809084) and relatively long tail (excess kurtosis = 1.1196987).

The boxplot identifies many values as outliers.

Volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The volatile.acidity has a similar right-skewed distribution (skewness = 0.6703331) and a similarly long tail (kurtosis = 1.2126893).

The distribution seems a bit multi-modal. We can see this on a higher-resolution histogram.

The boxplot again identifies several values as outliers.

Citric acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The citric.acid variable has many values equal to 0, as well as one value equal to 1 (also shown by the boxplot as an outlier). This outlier can be either a data entry error, or a wine that has excessive amount of citric acid (more than the limit of the measuring instrument).

As the value table below shows, the resolution of the measurement has been only 0.01 g/dm3, which means that all wines containing less than that will appear as 0 in the dataset.

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

We can get a better idea about the distribution by eliminating values 0 and 1. We will also adjust the binwidth to correspond to the measurement resolution.

This distribution appears artificially uniform towards the lower values (i.e. there is no gradual reduction of the frequency of the lower values). This can be probably explained with the fact that the winemakers usually add some amount of citric acid to give the wine a fresh (non-flat) body. Further confirmation for this is the presence of three very prominent peaks in the histogram (at 0.02, 0.24 and 0.49); especially the last two can signify that the citric amount has been artificially boosted to 0.25 or 0.50. If the citric acid amounts were due to a natural process (like fermentation, as is the case with the other acids), then its distribution would be more bell-shaped (the central limit theorem calls for normal distribution of the combined effect of many random processes.)

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The IQR range (0.7) of the residual.sugar values is pretty small compared to the total range (14.6). This signifies that we’re dealing with wines of mostly the same class of sweetness; which is not surprising given that the “Vinho Verde” region produces predominantly very fresh wines.

We can classify the wines in terms of sweetness according to the scale mandated by EU directive 753/2002. This scale runs like this:

Sugar content [g/dm3] < 4 (4, 12] (12, 45] > 45
Sweetness Dry Medium Dry Medium Sweet

We create a new ordered factor called sweetness in the original dataframe having levels corresponding to the sweetness degrees above. As the frequency table below shows most of the wines are “dry”.

## 
##        dry medium.dry     medium      sweet 
##       1474        117          8          0

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The amount of NaCl (chlorides) also shows a bell-shaped distribution with a long tail (excess kurtosis = 1.2126893). Most values are again concentrated in a small region (IQR=0.02 compared to range=14.6). This can be explained by the fact that all wines come from a geographically constrained region; it is known that the soil type and micro-climatic conditions of the region have direct influence on the salinity of the wine. Interestingly, some countries limit the amount of chlorides accepted in a wine (e.g. in Brazil it is 0.2 g/dm3, while in Australia it is 0.6 g/dm3). Generally, levels above 0.5 g/dm3 start to give the sensory perception of saltiness (although it depends on the national diet). There is one “outlier” wine in the sample that can trigger a “salty” grimace of the taster.

Free SO2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The winemakers use SO2 as an antioxidant and disinfectant. Before bottling the wine they adjust the levels of free SO2 usually between 10 and 40 mg/dm3 (careful producers relate the amount of SO2 to the pH–red wines with low pH need less SO2 than ones with higher pH). This range is also pretty visible in the histogram above.

Total SO2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

In Europe the permitted level of total SO2 is 150 mg/dm3 for dry red wines (200 mg/dm3 for sugar levels above 5 g/dm3). This requirement is also visible from the graphs above. There are two wines that surpass significantly this limit (theoretically they can’t be sold on the EU market, but this is possible for the US where the permitted level is 300 mg/dm3).

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density variable follows a normal distribution (Shapiro-Wilk gives 0.9908655 at P=1.936052810^{-8}). The Normal Q-Q plot below confirms this, but indicates also some heavier tails (hence the outliers in the boxplot).

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH variable also follows a normal distribution (Shapiro-Wilk gives 0.9934863 at P=1.712237310^{-6}). The Normal Q-Q plot below confirms this, but indicates also some heavier right tail (hence the outliers in the boxplot).

The mean pH lies at 3.3 which is consistent with the observation that the wines of “Vinho Verde” are pretty acid and fresh.

Sulfates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The sulfates variable (measuring predominantly the amount of K2SO4) has a bell-shaped right-skewed distribution, with longer right tail.

Normally fresh (not very old) wines contain around 0.4-0.7 g/dm3 of K2SO4 which is very well seen in the histogram (median 0.62).

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol content shows a significant right-skewedness (skewness = 0.8592144). As for all right-skewed distributions, this is due to a value limit situated on the left. This limit is the stipulation of the European legislation for minimum alcohol content to be 8.5%. The maximum cannot surpass 15%. We see this range clearly in the statistics above.

Quality

The quality variable is categorical, so we will plot it in a barplot:

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

We see that the majority of wines are in the middle quality range.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 1599 observations of red wines from the Portuguese region “Vinho Verde”. Each observation has 11 inputs (chemical measurements) and 1 output (subjective sensory assessment of quality).

All laboratory measurements (input variables) are numeric and have the following meaning (in brackets are the units of measurement):

  • Fixed acidity [g/dm3] - the amount of “good” non-volatile acids (e.g. malic, tartaric)
  • Volatile acidity [g/dm3] - the amount of “bad” acids (mostly acetic acid)
  • Citric acid [g/dm3] - the amount of citric acid
  • Residual sugar [g/dm3] - the amount of sugar left after the fermentation has ended
  • Chlorides (mostly NaCl) [g/dm3] - the amount of salts in the wine
  • Free sulfur dioxide (SO2) [mg/dm3] - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
  • Total sulfur dioxide (SO2) [mg/dm3] - amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
  • Density [kg/dm3] - the density of is close to that of water [1kg/dm3] depending on the percent alcohol and sugar content
  • pH - describes the acidity of the wine: 0 (very acidic) to 14 (very basic)
  • Sulfates (mostly K2SO4) [g/dm3] - a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant agent.
  • Alcohol [% by volume] - the percentage of alcohol content

The output variable is the factor quality having the following ordinal scale: 0 (lowest) to 10 (highest). The value of this variable is obtained as the median of three subjective sensory tasting assessments performed by different oenologists. The measured wines are in the range 3-8, i.e. no wine was classified as “very bad” or “excellent”.

What is/are the main feature(s) of interest in your dataset?

We’re interested in finding a relationship that can predict the wine quality based on the various chemical measurements. It is not very easy to make apriori assumptions as to which chemical variables have a significant effect on the sensory perception of the wine. Still, what comes first to mind is the alcohol content and the pH, both being aspects that are very easily perceived. Of the other chemical substances, the measured chloride contents is mostly below the detection threshold. Likewise, sulphur dioxide (aka sulfites) have neither smell, nor taste, so it is not very likely that they can directly determine the tasting assessment. On the other hand, the sulphates can have very different tastes depending on their concentration, as shown on the diagram below.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Citric acid gives a freshness to the wine, and as such will probably affect positively the tasting. Contrarily to it, the acetic acid (quantified by the volatile.acidity variable) will give an unpleasant characteristic vinegar taste. We have to keep in mind that higher acidity will correlate with low levels of pH, therefore it is important to distinguish the case when low pH is due to the bad “acetic” acid.

Did you create any new variables from existing variables in the dataset?

A new categorical variable sweetness was created to classify each wine according to its residual sugar content. It has the following levels: “dry” < “medium dry” < “medium” < “sweet”.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There are two rather normal distributions: pH and density.

Most of the distributions, however, are right-skewed. This normally happens when we measure quantities that cannot be below a given limit (e.g. due to some law requirements, as is the case with the alcohol contents). Such distributions can usually benefit from a log transformation, as this increases their normality. An example is given below for the total SO2 variable.

Before the log transformation, the total SO2 variable has a normality statistics, as determined by the Shapiro-Wilk test, equal to 0.8732246. After the log transformation, Shapiro-Wilk gives 0.9899255, i.e. the transformed variable has more normal distribution. The histograms before-and-after below confirm this:

The most unusual distributions is the one of the citric acid. For didactic purposes, an attempt will be made to bring to normalization the distribution of the citric acid. One way to do this is to use the Box-Cox transformation, which is a type of Power Transform (as discussed here used frequently in many areas involving statistical analysis).

First, we plot the Q-Q normality plot of the original citric acid variable. It shows a rather non-normal distribution.

Then we perform the Box-Cox transformation with a plot of the Log-Likelihood profile

Finally we plot the Q-Q plot for the transformed variable, as well a superposition of the original and transformed densities.

It seems that the Box-Cox transformation in this case does not produce a very convincing result…

Bivariate Plots Section

To achieve a quick overview of all bi-variate relations, we’ll produce a scatterplot matrix:

From the boxplots in this matrix we easily see that our variable of interest (quality) has very pronounced positive correlation with citric acid, sulphates and alcohol. It has negative correlation with volatile acidity, chlorides, density and pH. Its relation to the rest of the input variables seems less clearly defined.

A better view of the dependence of quality on the input variables can be obtained from the figures below. They plot conditionally the median of each input variable at each level of quality:

For an easier identification of the remaining significant correlations, a correlogram can help:

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We see a very clear trend of increasing sensory quality with higher alcohol content. Amazingly, a similar trend is seen for the sulphates. Volatile acidity correlates negatively with quality, which was expected, given the unpleasant taste of the acetic acid. Citric acid on the other hand has a positive correlation, which is also not surprising, giving its “freshness” sensory effect. An interesting observation is that an increase in the chlorides leads to quality deterioration, even if the amounts are below the taste thresholds. The other variables–fixed acidity, SO2 contents, and residual sugar do not have a clear relationship with the quality perception.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The correlogram shows pretty strong positive correlation (blue color) between:

  • Fixed acidity and citric acid (cor=0.6717034)—not surprising, since the amount of citric acid is already included in the measurement of the total “good” acids content.

  • Fixed acidity and density (cor=0.6680473)—this is due to the fact that the three main acids in wine (citric, malic and tartaric) have higher density than water (1.67, 1.61 and 1.79 g/cm3), therefore their presence increases the overall density of the wine.

  • The two SO2 measurements (cor=0.6676665)—normal, since the free SO2 is included also in the total measurement.

  • Residual sugar and density (cor=0.3552834)—as sugar is heavier than water (1.587 Kg/L vs. 1.000 kg/L), its presence tends to increase the overall density of the wine.

Strong negative correlation (red color) was found between:

  • Fixed acidity and pH (cor=-0.6829782)—as expected, since higher acidity manifests itself in lower pH.

  • The same for citric acid and pH (cor=-0.5419041).

  • Density and alcohol content (cor=-0.4961798)—since alcohol has lower specific density than water (0.780 kg/L compared to 1.000 kg/L), its presence tends to reduce the overall density (all other factors being equal).

Somewhat unexpected is the positive correlation seen for volatile acidity (acetic acid) and pH (normally all acids cause lower pH). This can be due to the following phenomenon (see here): The “acetobacter aceti” bacteria responsible for the vinegar fermentation and the production of acetic acid thrives in wines with lower fixed acidity and SO2 levels. Therefore, wines with higher volatile acidity will tend to have lower fixed acidity (as we see in the negative correlation between fixed/citric and volatile acidity below).

This lower fixed acidity probably is not offset by the higher content of volatile acidity (acetic acid is a very weak acid), and therefore the pH tends to rise. This acidity-pH relationship is illustrated in the diagrams below:

What was the strongest relationship you found?

Visually judging from the plots above, the strongest relationship could be the one between the amount of citric acid and quality.

Multivariate Plots Section

An interesting experiment is to superimpose the density plots of the different variables that can be assumed to be predictors, conditional to the quality factor.

Another possibility is to visually explore a given chemical measurement across both quality and sweetness levels. This can be justified since the sweetness is one of the first factors by which people evaluate wines.

One issue with these plots is that there are very few observations including “medium” wines–it would be best probably to completely ignore the visualizations for this factor level.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The MV plots confirmed the already observed relationships of some variables to quality. In this case we see clearly from the density plots that the distribution modes of these variables change monotonously (increase for citric.acid, alcohol and sulphates; decrease for volatile.acidity) across the increasing levels of quality.

Were there any interesting or surprising interactions between features?

From the density plots we also see that the variances of some of the variables (like citric.acid and sulphates) at different quality levels are very similar. For others (volatile.acidity and alcohol) they are different, and interesting enough, they differ in opposite ways—alcohol variance increases with quality level, while volatile.acidity’s variance decreases.

What this means is that the lower-quality wines can have very different amounts of acetic acid; meaning in turn that there is another more potent factor that comes into play and downgrades the wine. One such factor could be the amount alcohol–if it is low, then the wine goes to the bad side, even if there is little acetic acid in it.

On the reverse, high quality wines can have very different values of alcohol content. This could be due to another factor that keeps the quality high despite the varying alcohol content. Maybe this is a low amount of acetic acid.

To test this hypothesized correlation between alcohol and acetic acid levels we can plot their scatterplot faceted by the quality level.

Indeed we see that there is significant correlation between alcohol and acetic acid content at the two quality extremes. So, the “very bad” quality wines steadily feature a low alcohol content; similarly, the “excellent” wines mostly have low volatile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Since the output variable is categorical (and have an ordinal scale), we have to consider an ordered logistic regression (see here) approach to modeling the quality behavior. In other words we cannot use a method like “lm” or OLS (ordinary least squares) regression which operate on a continuous dependent variable.

We will use the polr command from the MASS package to estimate an ologit regression model. This command relies on the “proportional odds assumptions”, i.e. it assumes the relationships between all pairs of outcome groups to be the same (in a robust study this is something to be tested, however it is far beyond the scope of the course).

In choosing the predictor variables for the regression, it is important that they are not correlated. I.e. it is of no use feeding the model both with total.sulfur.dioxide and free.sulfur.dioxide. As the analyses above shown the quality output is considerably affected by predictors such as citric.acid, volatile.acidity, sulphates, and alcohol. We will create a number of proportional odds regression models, using different combinations of predictors, and will compare their predictive strengths. The goodness-of-fit is measured via an invented indicator–predict.rate–which is simply the percentage of correctly predicted quality values against the total number of observations (another possible measure could be for example MAD (mean absolute deviation)).

The models will try are the following:

  • quality ~ citric.acid
  • quality ~ citric.acid + volatile.acidity
  • quality ~ citric.acid + volatile.acidity + alcohol
  • quality ~ sqrt(citric.acid) + volatile.acidity + alcohol
  • quality ~ citric.acid + volatile.acidity + log10(alcohol)
  • quality ~ citric.acid + volatile.acidity + alcohol + sulphates
  • quality ~ citric.acid + volatile.acidity + log10(alcohol) + sulphates
  • quality ~ volatile.acidity + alcohol + sulphates
  • quality ~ alcohol + pH
  • quality ~ alcohol + pH + volatile.acidity
  • quality ~ log10(alcohol) + pH + log10(volatile.acidity)

After fitting these models, a summary table will be printed sorted by increasing value of the “prediction rate”.

According to the table above, the best model (i.e. the one having the best prediction rate) is:

## [1] "quality ~ citric.acid + volatile.acidity + log10(alcohol)"

It also has a relatively low AIC (Akaike Information Criterion), meaning that its “difficulty (information loss) to goodness-of-fit” ratio is one of the better among the tried models.

The coefficients and other fitting statistics of this model are:

## Call:
## polr(formula = best_model$formula, data = w, Hess = T)
## 
## Coefficients:
##                     Value Std. Error  t value
## citric.acid       0.06291     0.3129   0.2011
## volatile.acidity -4.01343     0.3680 -10.9067
## log10(alcohol)   23.23954     1.3390  17.3565
## 
## Intercepts:
##     Value    Std. Error t value 
## 3|4  15.7292   1.4057    11.1895
## 4|5  17.6521   1.3736    12.8513
## 5|6  21.2355   1.3834    15.3500
## 6|7  23.9430   1.4242    16.8117
## 7|8  26.8746   1.4603    18.4029
## 
## Residual Deviance: 3182.95 
## AIC: 3198.95

We see that the predictive power of these models is far from satisfactory (< 60%). Probably, we can have better results using a higher-order regression model or some machine learning approach—for example, Random Forests or Support Vector Machines (SVM) come to mind.


Final Plots and Summary

Plot One

Description One

Most of the measurements feature right-skewed distributions, some of them with very long but thin tails (residual.sugar, chlorides), others with fat tails (alcohol, sulphates, sulfites, fixed.acidity). Closest to normal are the distributions of pH and density.

The volatile.acidity has right-skewed, probably bi-modal distribution.

Finally, citric acid has a what appears to be tri-modal distribution, sharply cut to the left due to the measurement’s limited precision.

Plot Two

Description Two

The correlation matrix shows visually and numerically the relationships between the different measurement variables. Red is negative correlation, blue is positive. The more saturated a color is, the stronger the correlation. The correlation coefficients are given with their 95% CI. As identified in the section for Bi-variate analysis, there is a physical/chemical causation explaining most of the strong correlations (e.g. higher fixed acidity translates into a lower pH).

Plot Three

Description Three

This final plot shows how the median of the observations for each measured variable behaves at different quality levels. Monotonous plots (like the ones for alcohol, citric acid, volatile acidity, sulphates) directly identify those physicochemical properties that can be used as predictors of the quality output. They can serve as inputs to regression or machine learning models.


Reflection

Finding a way to predict the sensory quality of wines based on their physicochemical properties can fulfill a dream of winemakers and oenologists.

Using datamining techniques operating on large freely available datasets of wine measurements can probably achieve this.

We can go even further and imagine models that can predict the origin and sort of the grapes from the chemical measurements; or even the climatic quality of the vintage year.

With the given dataset of red wine observations from the Vinho Verde region, it was already possible to determine some chemical properties that can be used as predictors for the quality output. The most apparent ones are the alcohol content, the volatile acidity, the amount of citric acid and of sulphates. Especially the latter was an unexpected relationship.

An attempt was made to obtain an ordinal proportional-odds regression model, using the thus identified predictors. The goodness-of-fit of the found model was not deemed satisfactory, suggesting that the use of more advanced methods like Random Forests can lead to better results.

Unfortunately the dataset does not include any measurements of tannin content; yet tannins (the natural polyphenols giving the astringent taste of wine) are very important for the sensory perception of wines.

Bibliography

When a wine is salty, and why it shouldn’t be

Chloride concentration in red wines: influence of terroir and grape type

Vinho Verde on Wikipedia

Box-Cox Normality Transformation

Grainger K., Tattersall H., Wine Production and Quality, Wiley, 2016